English to Persian Transliteration
نویسندگان
چکیده
Persian is an Indo-European language written using Arabic script, and is an official language of Iran, Afghanistan, and Tajikistan. Transliteration of Persian to English—that is, the character-bycharacter mapping of a Persian word that is not readily available in a bilingual dictionary—is an unstudied problem. In this paper we make three novel contributions. First, we present performance comparisons of existing grapheme-based transliteration methods on English to Persian. Second, we discuss the difficulties in establishing a corpus for studying transliteration. Finally, we introduce a new model of Persian that takes into account the habit of shortening, or even omitting, runs of English vowels. This trait makes transliteration of Persian particularly difficult for phonetic based methods. This new model outperforms the existing grapheme based methods on Persian, exhibiting a 24% relative increase in transliteration accuracy measured using the top-5 criteria.
منابع مشابه
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration
We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our new model improves the English to Persian transliteration accuracy by 14% over an n-gram baseline. We also propose a novel back-transliterati...
متن کاملThe Amirkabir Machine Transliteration System for NEWS 2011: Farsi-to-English Task
In this paper we describe the statistical machine transliteration system of Amirkabir University of Technology, developed for NEWS 2011 shared task. This year we participated in English to Persian language pair. We use three systems for transliteration: the first system is a maximum entropy model with a new proposed alignment algorithm. The second system is Sequitur g2p tool, an open source gra...
متن کاملCharacter Sequence Modeling for Transliteration
The Character Sequence Modeling (CSM), typically called the Language Modeling, has not received sufficient attention in the current transliteration research. We discuss the impact of various CSM factors like word granularity, smoothing technique, corpus variation, and word origin on the transliteration accuracy. We demonstrate the importance of CSM by showing that for transliterating into Engli...
متن کاملSyllable Based Transcription of English Words into Perso-Arabic Writing System
This paper presents a rule-based method for transcription of English words into the PersoArabic orthography. The method relies on the phonetic representation of English words such as the CMU pronunciation dictionary. Some of the challenging problems are the context-based vowel representation in the Perso-Arabic writing system and the mismatch between the syllabic structures of English and Persi...
متن کاملAn Unsupervised Alignment Model for Sequence Labeling: Application to Name Transliteration
In this paper a new sequence alignment model is proposed for name transliteration systems. In addition, several new features are introduced to enhance the overall accuracy in a name transliteration system. Discriminative methods are used to train the model. Using this model, we achieve improvements on the transliteration accuracy in comparison with the state-of-the-art alignment models. The 1be...
متن کامل